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Abstract— FASTEST-3D is an MPI-parallel finite-volume flow 
solver based on blocli-structured meshes that has been developed 
at the University of Erlangen-Nuremberg since the early 1990s. 
It can be used to solve the laminar or turbulent incompressible 
Navier-Stokes equations. Up to now its scalability was strongly 
limited by a rather rigid communication infrastructure, which 
led to a dominance of MPI time already at small process counts. 

This paper describes several optimizations to increase the 
performance, scalabiUty, and flexibflity of FASTEST-3D. First, 
a node-level performance analysis is carried out in order to 
pinpoint the main bottlenecks and identify sweet spots for 
energy-efficient execution. In addition, a single-precision ver- 
sion of the solver for the linear equation system arising from 
the discretization of the governing equations is devised, which 
significantly increases the single-core performance. Then the 
communication mechanisms in FASTEST-3D are analyzed and 
a new communication strategy based on non-blocking calls is 
implemented. Performance results with the revised version show 
significantly increased single-node performance and consider- 
ably improved communication patterns along with much better 
parallel scalability. In this context we discuss the concept of 
"acceptable parallel efficiency" and how it influences the real 
gain of the optimizations. Scaling measurements are carried out 
on a modern petascale system. The obtained improvements are 
of major importance for the use of FASTEST-3D on current 
high-performance computer clusters and will help to perform 
simulations with much higher spatial and temporal resolution to 
tackle turbulent flow in technical applications. 

I. Introduction and related work 

The numerical solution of turbulent, unsteady flow problems 
requires vast amounts of compute resources in order to get 
reasonable accuracy and time to solution. This problem can 
be tackled in two complementing ways: advanced numerical 
simulation algorithms and highly efficient, scalable implemen- 
tations. 

This paper deals with FASTEST-3D, a finite volume solver 
based on co-located, block-structured meshes. It originated 
from the Institute of Fluid Mechanics (LSTM) at the Univer- 
sity of Erlangen-Nuremberg is being developed since the early 
1990s. Today different versions of FASTEST-3D exist at the 
University of Erlangen-Nuremberg, the Technical University 
of Darmstadt, and the University of Freiberg. The code is used 
to solve the laminar or turbulent incompressible Navier-Stokes 



equations. Time evolution is based either on implicit schemes 
like Crank-Nicolson or on explicit low-storage multi-stage 
Runge-Kutta time advancement schemes. The linear equation 
system resulting from the discretization of the governing 
equations is solved using Stone's fl] strongly implicit method 
(SIP), which is based on an incomplete LU factorization. 
Several k-e and large-eddy simulation (EES) models are avail- 
able for the simulation of turbulent flow. Different coupling 
interfaces exists in order to simulate, e.g., fully coupled fluid- 
structure interaction 0, 13], one-way coupled acoustic inter- 
action |4|, and flow with chemical reactions. Parallelization 
in FASTEST-3D is based either on shared memory or on 
domain decomposition. The latter approach is also used for 
the MPI domains in the hybrid MPI/OpenMP version of 
FASTEST-3D. During the past years specific parallelization 
and optimization strategies have been implemented in order 
to improve the performance of FASTEST-3D on different 
high performance systems including mixed/shared-memory 
parallelization strategies ||5l, ||6|. Most of those optimizations 
have focused on vector computers with a low number of 
processors and a relatively high workload per process, hence 
the communication overhead was relatively small. This is 
different on massively parallel systems with a relatively small 
workload per process, which leads to increased communication 
overhead, causing poor parallel efficiency. In this paper we 
focus on the improvement of single-core (and thus single- 
node) performance by work reduction and the optional use of 
single-precision arithmetic, and on the optimization of com- 
munication performance by using non-bocking point-to-pint 
MPI functions, to the effect of a much-improved massively 
parallel scalability. 

This work is organized as follows: In Sect. |ll] we give an 
overview of the hard- and software used and the benchmark 



cases. Section III shows an analysis of the FASTEST-3D 



code in terms of function profiles and communication patterns. 



These results are used in Sect. IV for the optimization of 
single-core execution and the reduction of communication 
overhead. In Sect. [V] we demonstrate the achieved improve- 
ments in performance and scalability on a petascale-class 
system. Finally we give a summary and an outlook to future 
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work in Sect. |VI] 

II. Test bed and benchmark cases 

A. Test systems 

The scaling results in this paper were obtained on Super- 
MUC [7], a federal system installed at Leibniz Supercom- 
puting Centre (LRZ) in Garching, Germany. SuperMUC is 
available to scientists across Europe as a tier-0 system within 
PRACE (Partnership for Advanced Computing in Europe). 
It consists of 18 islands with 512 compute nodes each. A 
compute node comprises two Intel Xeon E5-2680 ("Sandy 
Bridge") eight-core processors, with an achievable aggregate 
memory bandwidth of about 80 GByte/s (as obtained with the 
standard STREAM benchmarks). The islands are fully non- 
blocking QDR InfiniBand fat trees, with a 4:1 oversubscribed 
fat tree across islands. 



The sequential function profiles in Sect. III-A were taken 
on the LiMa cluster at Erlangen Regional Computing Center 
(RRZE) at the University of Erlangen-Nuremberg. A compute 
node on LiMa consists of two Intel Xeon X5650 ("Westmere") 
six-core processors, with an achievable aggregate memory 
bandwidth of about 40 GByte/s. 

The Intel compiler in version 12.1 and the Intel MPI library 
in version 4.1 were used for all benchmarks on both systems. 

B. Test problems 

Three different test problems were considered for the pur- 
pose of analysis: 

I flow over a forward-facing step ||8( 
II turbulent flow in a plane channel, cf. Moser et al. ||9l 
III Taylor-Green Problem HOJ 

For test case I, the simulation of the flow over a forward- 
facing step, we used three different grids consisting of a total 
number of 12.5 • 10^ 280 • 10^ and about 2200 • 10^ control 
volumes, respectively. This case was used as an example of 
a real configuration with grid sizes according to real-world 
large-eddy (12.5 Mio. control volumes) and direct numerical 
simulations (280 and 2200 Mio. control volumes). Figure [T] 
shows the setup and Fig. |2] shows computed isosurfaces for 
this problem. Test case II was used for validation of both 




Fig. 2. Isosurfaces of turbulent pressure fluctuations for the flow over a 
forward-facing step computed via DNS; isovalues: 9/-9 Pa 



the non-blocking communication strategy and especially the 
single precision version of the linear equation solver This 
test considers the fully developed turbulent flow in a two- 
dimensional plane channel with periodic boundary conditions 
in stream- and span-wise directions and no-slip wall boundary 
conditions at the bottom and top wall. Test case III was 
also used for verification and to evaluate the single-socket 



performance (see Sect. IV-A below), since this problem is 
perfectly load-balanced including the boundary conditions 
(periodicity in x and y direction and symmetry at bottom and 
top). 

III. Systematic code analysis of FASTEST-3D 

As a first step for the implementation of different commu- 
nication, parallelization and optimization strategies, a function 
profile of the original version of FASTEST-3D was taken 
along with a basic investigation of its requirements towards the 
hardware (such as memory bandwidth) and its communication 
patterns. For this analysis we used the likwid tools |ll], the 
GNU profiler gprof, and the Intel Trace Analyzer and Collector 
d. 

A. Function profile 

Tables |l] and |ll] show function profiles for the serial version 
of FASTEST-3D using test case I (forward-facing step) with 
implicit and explicit time discretization, respectively. Only the 
ten subroutines consuming most of the total runtime are listed. 



In addition. Table III shows the profile for the explicit time 



discretization using six MPI processes on one socket of LiMa, 
to obtain a profile representative for parallel execution on the 
first bottleneck (the socket) without the adverse effects of com- 
munication overhead. The routines to solve the linear equation 
systems arising from the discretization of the Navier-Stokes 
equation are resforwardJd, lu3d and backwardJd. These rou- 
tines together consume between 45 % and 55 % of the total 
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time % 


self [s] 


calls 


name 


25.83 


56.48 


7500 


resforward3d 


22.53 


49.26 


945 


celuvw 


11.54 


25.22 


1200 


lu3d 


9.03 


19.74 


900 


celp2 


8.67 


18.96 


7500 


backward3d 


3.80 


8.31 


100 


calcp 


3.45 


7.54 


1916 


exvec 


2.96 


6.48 


300 


coefadd 


2.96 


6.47 


105 


caluvw 


2.32 


5.07 


205 


calcdp 



TABLE I 

Sequential function prohle, implicit time discretization, test 

CASE I 



time % 


self [s] 


calls 


name 


37.45 


125.03 


8100 


resforward3d_pp 


14.52 


48.47 


8100 


backward3d pp 


11.26 


37.59 


1800 


flxl 


9.01 


30.07 


645 


calcdp_exp 


6.78 


22.65 


15 


calcp_exp 


5.72 


19.09 


270 


celuvw_exp 


5.06 


16.89 


900 


lu3d_pp 


1.85 


6.19 


2400 


setval 


1.60 


5.33 


15 


runge_kutta 


1.45 


4.85 


3600 


flxnbl 



TABLE III 

Function profile using six MPI processes on one socket of LiMa 

FOR explicit time DISCRETIZATION OF TEST CASE 1 



time % 


self [s] 


calls 


name 


Routine 


Mem. BW [MByte/s] 


32.93 


26.93 


3600 


resforward3d 


Fastest-3D 


6212 


16.68 


13.64 


900 


flxl 


calcpexp 


10894 


11.26 


9.21 


3600 


backward3d 


calcp_exp (2) 


6728 


9.34 


7.64 


135 


celuvw_exp 


flxl 


3845 


7.74 


6.33 


300 


lu3d 


calcdp_exp 


12570 


6.49 


5.31 


5 


calcp_exp 


calcdp exp(2) 


13334 


5.09 


4.16 


215 


calcdpexp 


lu3d 


4204 


1.77 


1.45 


1529 


exsca 


backward3D 


7025 


1.47 


1.20 


5 


runge_kutta 


resforwardSD 


7116 


1.22 


1.00 


1800 


flxnbl 


celuvw_exp 


2490 



TABLE II 

Sequential function Profile, explicit time discretization, test 

CASE I 



TABLE IV 

PER-ROUTINE analysis of memory bandwidth for the SERIAL 

version (test case 1) ON LiMa 



execution time. This fraction depends on the settings for the 
linear equation solver and on the time discretization scheme 
used: In the case of an explicit time discretization only one 
equation for the pressure correction has to be solved, while in 
the implicit time discretization equation systems for the three 
Cartesian components of the momentum equations have to be 
solved in addition. Another factor is the linear equation solver, 
i.e., the number of iterations per correction step. However, 
based on this analysis one can identify the linear equation 
solver as the most time-consuming part for both parallel and 
serial execution. Hence, any serial code optimization should 
first try to improve the performance of this part of FASTEST- 
3D. One issue which can not be seen from the profile tables 
is that the arrays used to store the coefficients of the linear 
equation system are reused and overwritten by other, unrelated 
subroutines of FASTEST-3D. Hence, even if the coefficients 
for the equations do not change (as for the pressure correction 
equation in the explicit time discretization), the coefficients 
have to be recalculated. Since the explicit time discretization 
is of special interest for DNS and LBS, a redesign of the solver 
should avoid unnecessary recalculation of the coefficients. 
Finally, it should be mentioned that the equation system for the 
pressure correction equation is symmetric, but this symmetry 
has not been exploited so far. 

Another possibility to gain more insight is the use of 
likwid-perfctr |fl], a simple command-line tool to measure 
hardware performance metrics on x86 processors under Linux. 
It also features a lightweight API which enables restricting the 
measurements to certain parts of the code. This allows, e.g., 
to get the memory bandwidth used while executing a specific 



loop in the code. The result for a serial run of FASTEST- 



3D is shown in Tab. IV In principle, by this analysis a piece 
of code can be identified as being either memory bound or 
compute bound. Note that a single core cannot saturate the 
memory bandwidth of a socket on modern multicore chips 
even when running strongly memory-bound code llT3l . for 
which FASTEST-3D is a typical example. It can be seen from 



Tab. IV that the overall memory bandwidth is in the range of 



6GByte/s using the serial version, with individual routines 
drawing between 2.5GByte/s and 13GByte/s. The latter is 
also roughly the maximum bandwidth obtained by a simple 
sequential streaming benchmark such as STREAM. With 6 
Threads on one socket of LiMa, an average bandwidth of over 
15GByte/s for the whole application can be reached, which 
is close to the limit of 20GByte/s. Multiplying the values of 
the memory bandwidth in Tab. [iVjwith the number of threads, 
most of the measured code sections would easily exceed the 
maximal available memory bandwidth. As a consequence, 
FASTEST-3D shows the typical saturation behavior of a 



bandwidth-limited code on a multicore chip. See Sect. IV-A 
below for some further analysis. 

B. Communication pattern 

An analysis of the communication pattern has been per- 
formed using Intel Trace Analyzer and Collector (ITAC). 
Typical screen shots of a run with the original code for test 
case II are shown in Fig.|3]for a total number of 96 processes. 
A single process per node was run in this experiment so that 
intra-socket saturation effects could be ruled out. As can be 
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seen from Fig. |3(a)| more time is spent for communication 
(or rather within the MPI library) than for computation. The 
number of control volumes (CVs) per process for this problem 
is 65536, which is a realistic number for a production run of 
FASTEST-3D, given enough parallel resources. In a produc- 
tion environment one would require a minimum acceptable 
parallel efficiency of 50 % (single node baseline) at problem 
sizes of approximately 50000 CVs per process. The parallel 
efficiency is defined as 

where Tg and Tp{N) are the execution times on a single 
node and on N nodes, respectively. Since the simulations can 
not always be run on a single node due to the problem not 
fitting into main memory, an alternative version for the parallel 
efficiency will also been used below, where the baseline is not 
a single node but M nodes: 



i-p,M 



iN) = 



(2) 



NTp{NM) ' 

Using this definition, the base configuration already contains 
a certain amount of (inter-node) parallel overhead. 

Based on the measurements obtained from ITAC, the par- 
allel efficiency for this configuration is already below the 
acceptable limit of 50%. From Fig. 3(b) it can be seen 
that the blocking MPI Recv and MPI Send functions are 
used for point-to-point communication between adjacent sub- 
domains/processes. While some processes return relatively 
quickly from the call to MPI Recv, others spend much longer 
in the call until a matching MPI Send has been issued. This 
leads to a partial serialization of the communication, since the 
blocking calls to send/receive functions are issued sequentially 
on each process, and successive calls can not be issued until 
the current call returns. This kind of partial communication 
serialization is a well-known pattern llT4l that may be over- 
come by using non-blocking point-to-point send/receive calls 
(MPI Isend/MPI Irecv). Also a large part of the communica- 
tion time is lost in the collective function MPI_Allreduce. This 
is a consequence of the artificial load imbalance caused by the 
serialization described above. Moreover, reduction operations 
are mainly used in FASTEST-3D in order to compute the 
residual. By keeping those evaluations to a minimum, the 
overhead from MPI Allreduce can be reduced substantially. 

IV. Code optimization 

Using the results of the code analysis and the conclusions 
drawn from it, a revised version of FASTEST-3D was devel- 
oped. In this section we give a short overview of the major 
changes that were implemented to improve the single-core 
performance and the scaling properties of FASTEST-3D on 
multicore systems. The correctness of the improved code is 
shown for the important case of a turbulent, fully developed 
channel flow. 

A. Improving the single-core performance 

From the measurements of the memory bandwidth and the 
runtime profiles it was concluded that a relatively large part 
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(c) Distribution of communication time on different MPI functions 

Fig. 3. ITAC analysis: Overview, event timeline, and time spent per MPI 
routine using one MPI process per node on the LiMa cluster for test case II 
and 96 processes 



of the overall runtime was spent solving the linear equation 
system and that FASTEST-3D is a mostly memory-bound 
application. This is true for both explicit and implicit time 
discretization, but the focus will be in the following on the 
explicit time discretization, which is of special interest for 
this work and direct numerical simulations performed with 
FASTEST-3D at the moment. Three possibilities to improve 
the performance of the explicit time discretization have been 
identified: 

1) avoiding unnecessary re-computation of the coefficients 
of the pressure correction equation or the incomplete 
factorization matrices (ILU) constructed in Stone's im- 
plicit method 1 1 1 

2) exploiting the symmetry of the pressure correction equa- 



5 



tion 

3) using single precision to solve the linear equation system 
Regarding the first point, the use of an implicit time discretiza- 
tion scheme was a very common use case for FASTEST-3D, 
and hence three equation systems resulting for the momentum 
and one equation system resulting for the pressure correction 
have to be solved. This is done in a segregated fashion in 
FASTEST-3D. By this the coefficient arrays can be shared by 
all equations and thus are overwritten by each other For the 
explicit time discretization this is not the case, since only one 
equation system is considered. Furthermore, the coefficients of 
this equation system do not change and need to be calculated 
only once. Consequently this is also true for the ILU factoriza- 
tion. Once computed, the resulting decomposition can be used 
for all iterations and time steps without the need to recompute 
it again. This fact was not exploited in the original version of 
FASTEST-3D. 

Concerning the second point, only the equation system for 
the pressure correction equation is symmetric, since the con- 
vective operator in the momentum equation is discretized using 
a deferred correction method and a blend of central and up- 
wind interpolation schemes yielding non-symmetric equation 
systems. Hence, exploiting this property is most advantageous 
for the explicit time discretization and the adoption for non- 
symmetric equation systems has still to be investigated but is 
not the focus of the present work. Taking advantage of this 
symmetry in the pressure correction equation is expected to 
be beneficial, since the subroutine in question, resforward3d, 
is memory bound. 

The third measure implemented to improve the single core 
performance is beneficial for both impUcit and explicit time 
discretization without restrictions and reduces the amount 
of data which has to be loaded while solving the equation 
system. Since the rest of the algorithm is still performed in 
double precision, an additional overhead for data conversion is 
generated, slightly reducing the possible gain in performance. 

All of the above strategies were implemented in a redesign 
of the linear equation solver and have changed the way the 
solver had to be interfaced. Figures]?] and |5] show the structure 
and call sequence of the SIP solver before and after the 
redesign. In the original version, a call to the SIP solver always 
implied a call to the subroutine for ILU factorization. Then the 
linear equation system was relaxed until a given number of 
iterations was reached. Finally, the pressure and velocity field 
was corrected using the computed pressure correction and the 
outer loop was checked for convergence, i.e., if the error in 
mass conservation is below a given threshold. 

The implementation of the new version was based on a 
modular concept. The solver allocates and deallocates memory 
for the coefficients, which are not overwritten any more and 
thus do not need to be recomputed in between. Furthermore, 
the computation of the ILU factorization is done in advance 
to the actual solver routine. Before performing forward and 
backward substitution, the right hand side and solution vectors 
have to be converted from double to single precision. After the 
relaxation, only the solution vector is converted back to double 
precision for the subsequent pressure and velocity correction. 
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Fig. 4. Original version of SIP solver 
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Fig. 5. Optimized version of SIP solver 



In order to take the symmetry of the linear equation system 
into account, the calculation of the residual, which is part of 
the forward substitution step, was changed so that only the 
values of the east, north and top coefficients are used (see 
Listing ]T]i. 
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Listing 1. Residual calculation for a symmetric system matrix 

do k = 2, nk-1 
do i = 2, ni— 1 
do j = 2, nj-1 
res (j , i ,k) = 

ae(j,i,k) * fi(j,i+l,k) 



Fastest 1 /Runtime vs. Cores, Sandy Bridge 



(j ,i-l,k) 
+ l,i ,k) 
(j ,k) 
, i ,k+l) 
i ,k-l) 
k) * fi(j 



(j 



+ ae (j , i — l,k) * fi 
+ an ( j , i , k) * f i ( 
+ an(j— l,i,k) * fi 
+ at ( j , i , k) * f i ( 
+ at (j , i ,k — 1) * fi 

+ su(j,i,k) - ap(j,i,k) * fi(j,i,k) 
enddo 
enddo 
enddo 

The effect of the different single-core optimizations has 
been measured on one socket of SuperMUC. The dashed 
line in Fig. |6] shows the inverse runtime per timestep of the 
code version using non-blocking communication (see the next 
section) and the original version of the SIP solver The solid 
line shows the performance of the non-blocking code version 
in combination with all single core optimizations applied in 
the new version of the SIP solver To evaluate the influence 
of the single core optimizations, additional measurements for 
the same case running with eight MPI processes were taken. 
They are shown with two additional scattered data points in 
Fig. |6] By performing the relaxation of the equation system 
using single precision arrays, a reduction in runtime by 25 % is 
achieved. This is in good agreement with the function profile 



shown in Tab. Ill where the two subroutines resforwardSd 
and backwardSd consume about 52 % of the overall runtime at 
double precision. Reducing the size of the data set by a factor 
of two leads to a speedup of two for those memory-bound 
routines. Avoiding the recomputation of the ILU factorization 
yields another reduction by 7 %, again in agreement with 



the function profile (see Tab. IIIi measured on LiMa. The 
last change, exploiting the symmetry in the system matrix, 
improves runtime by another 5 %. The solid line in Fig. [6] 
combines all optimizations. 

Only a slight increase in performance can be gained using 
eight instead of four cores on one socket. However, this also 
indicates the FASTEST-3D is not completely memory bound, 
since some parts of the code still benefit from the additional 
number of cores. 

B. Improving communication performance 

The data exchange at block boundaries in FASTEST-3D 
is based on transfer tables. These tables are set up in a 
preprocessing step and stored in a binary file which is read 
by FASTEST-3D after a simulation is started. Each process 
stores only information about the send/receive operations it 
is involved in. Also no separate tables for process-local and 
remote data transfer are present. Finally, all communication 
routines are wrappers around the basic MPI routines, providing 
a consistent interface no matter if the underlying communica- 
tion library is MPI, no communication (in case of a serial run). 
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Fig. 6. Scaling on one socket of an Intel Xeon Sandy Bridge node, non- 
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Fig. 7. Blocking data exchange at block boundaries in FASTEST-3D. 



or Parallel Virtual Machine (PVM). In addition, FASTEST- 
3D uses different exchange routines for scalars, vectors, and 
tensors. Figure |7] shows the basic flow of the blocking data 
exchange in the original version: Each process performs a 
loop over all its patches (block boundaries) that need to 
exchange data with their neighboring blocks, regardless of 
whether the neighboring block is located on the same process. 
If the patch between two neighboring blocks is on the same 
process, no MPI routines are called and only a local copy 
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operation is performed. Otherwise the data is sent to the 
neighboring process. Since MPI Send is a blocking call, the 
sending process has to wait until the data is received at least 
for larger messages. This leads to a partial serialization, as 
described above, and also applies to data copied to diagonal 
neighbors into block edges and corner points. 

In order to perform data exchange based on non-blocking 
MPI calls a module was created that, after an initialization 
process, sets up derived data transfer tables from the original 
tables by splitting these in data transfers between blocks on 
the same process and remote data transfers for remote sends 
and remote receives. For each kind of transfer an according 
exchange routine was provided. To ensure that all data is 
received by the correct target process the MPI message tag 
is used, which is incremented after each data transfer at block 
boundaries. In addition we exploit the property that if process 
A sends two or more messages to process B, the messages 
arrive in the same order as they were sent. As can be seen in 
Fig. [8] first the calls to MPI Irecv are made, than all send 
statements are issued, and finally MPI Waitall is called to 
ensure that all outstanding non-blocking sends and receives 
are finished. In a final step the data is copied from the receive 
buffers to the arrays which store the variables. In contrast to 
the original algorithm no data which was sent is available in 
the arrays which store the actual data since all data remains 
in the receive buffer until all send and receive statements have 
returned. Hence, the data at the block comer points and block 
edges is always data from the previous exchange operation. 
As long as all variables converge and a sufficient number of 
operations is done, this is no problem since the difference 
between variables from two consecutive exchange operations 
is very small at convergence. Also, the data at comers/edges 
is only needed for the solution of the equation system for 
the diffusive fluxes on the right hand side of the momentum 
equation. In order to have all data correctly exchanged at the 
same time level, a three step approach would be necessary, 
first exchanging corner points, then edges, and finally the 
inner points of block patches. After each exchange step a call 
to MPI Waitall would also be required, resulting in complex 
exchange routines and preprocessing as well as in additional 
overhead. 
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Fig. 8. Non-Blocking data exchange at block boundaries in FASTEST-3D. 
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C. Verification 

In order to verify the correctness of the new implementation, 
test problem II is compared to the original implementation. 
All three versions (original, with non-blocking calls, and non- 
blocking plus single-core optimizations) were used to compute 
the mean velocity profile for a turbulent channel flow (see 
Fig. [9|. The velocity in x direction (JJ) and the distance from 
the wall (y) are normalized by the friction velocity = 
with dynamic viscosity /i, and the kinematic viscosity divided 
by the friction velocity, respectively, which is indicated by 
the superscript +. All three variants are very close to each 
other and no clear advantage can be seen for either of them in 
terms of the quality of the solution. Compared to the DNS 
data of Moser et al. |9| a small deviation in the log-law 
region for all the three variants can be observed; this may 



Fig. 9. Comparison of original, non-blocking, and non-blocking version with 
single core performance optimizations of FASTEST-3D with respect to DNS 
data of Moser et al. |^9J 



be due to the lower approximation order for the derivatives 
in the governing equations compared to Moser et al. fO^j, who 
used highly accurate spectral methods while keeping the same 
grid sizes. On the other hand, the FASTEST-3D version with 
non-blocking calls and a single-precision solver performs best 
using the same hardware with respect to the time to solution. 

V. Performance AND Scaling Results 

In this section we present selected strong scaling measure- 
ments on SuperMUC for different code versions of FASTEST- 
3D with 16 processes per node (ppn). Results are reported for 
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the reciprocal of the averaged time spent per time step (l/T^) 
with runtime per time step Tr and parallel efficiency following 
the definition in Eq. (|2]i. Performance is plotted vs. the number 
of compute nodes, since this is the smallest allocatable unit on 
any modem cluster system. All results presented here are based 
on test case I, the flow over the forward facing step, which 
is a real-world problem and not only useful for benchmarking 
purposes. 

A. Scaling Results on SuperMUC 

Scaling measurements were performed for domain sizes of 
786432 control volumes per process down to 12288 for the 
coarsest grid, 387072 down to 24192 control volumes for the 
grid consisting of 280 Mio. control volumes and from 1.5 Mio. 
down to 193536 control volumes per process for the largest 
grid. The inverse of the averaged runtime per timestep and 
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the parallel efficiency are shown in Figures 10 and Fig. 1 1 for 
the blocking and non-blocking versions of FASTEST-3D at 
the smallest grid size. Dashed lines correspond to the original 
code version with blocking MPI calls and the original version 
of the SIP solver Solid lines denote the version with non- 
blocking MPI calls, and dotted lines correspond to the code 
version using non-blocking communication and the improved 



SIP solver. As can be seen from Fig. 10 the best performance 



is obtained for the non-blocking version with the improved 
SIP solver, but the non-blocking version with the original SIP 
solver shows better scaling because the slower linear system 
solver spends more time performing calculations between two 
subsequent communication steps. Using the optimized version 
of the SIP solver along with non-blocking communication, 
nearly the same performance as for the non-blocking commu- 
nication with the original SIP solver can be obtained using 
only half of the resources (32 vs. 64 nodes). Hence, if double 
precision is not necessary, using the optimized SIP solver 
along with non-blocking communication is the best choice 
in terms of overall cost since less resources for nearly the 
same time to solution are needed. The horizontal solid line in 
Fig. [TT] denotes the threshold of minimum acceptable parallel 
efficiency (50 %). 

Obviously the optimized version of the SIP solver is not 
efficient beyond 32 nodes at this small problem size. It is 
remarkable that the original version of FASTEST-3D is not 
efficient at all even for very small node counts (> 8), which 
is due to its severe communication inefficiencies. Comparing 
the performance at the limit of acceptable efficiency of 50 %, 
the new version performs about ten times better for the test 
problem. We regard this as the more relevant gain, as opposed 
to the factor of 5-6 achieved when comparing both versions 
at the same number of nodes (but vastly different parallel 
efficiencies). Whether 50% is the right threshold to apply 
is certainly debatable, but the overall conclusions are largely 
independent of the actual limit. 

In addition to the scaling runs at a moderate problem size, 
additional measurements for an existing setup of a direct 
numerical simulation (DNS) with approximately 280 • 10^ 
control volumes were done on SuperMUC comparing the new 
version with the original version of FASTEST-3D for a real 
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Fig. 10. Inverse runtime per timestep vs. nodes on SuperMUC, 16 ppn, test 
case I, 12.5 ■ 10'' CVs, strong scaling 
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simulation. Scaling results are shown in Figures 12 and 13 



Note that, in contrast to the measurements for the coarse grid, 
scaling has been performed here for more than 512 nodes. 



which means that for the last data point in Figures 12 and 13 
two network islands have been used, leading to a considerable 
drop in bisection bandwidth per node. Results are reported 
for the code version using non-blocking communication with 
the original SIP solver (solid line) and the original code ver- 
sion based on blocking communication (dashed line). Again, 
considering the inverse runtime, the code version using non- 
blocking communication shows much better parallel scaling. 
While for the non-blocking version the parallel efficiency 
drops below 50 % for more than 500 nodes, this is the case for 
the version using blocking calls already for about 180 nodes 
(see Fig [T3] l. Comparing both versions at approximately 50 % 
parallel efficiency, the non-blocking version is about 8 times 
faster using only twice the amount of resources at the same 
level of parallel efficiency. 



Fig. 12 also shows two additional scattered data points 
denoting the performance using a naive core-rank mapping 
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Fig. 12. Inverse runtime per time step vs. nodes on SuperMUC, 16 ppn, test Fig. 14. Inverse runtime per time step vs. nodes on SuperMUC, 16 ppn, test 
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case I, 2.2 ■ 10^ CVs, strong scaling 
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Fig. 13. Parallel efficiency on SuperMUC vs. nodes, 16 ppn, test case I, Fig. 15. Parallel efficiency on SuperMUC vs. nodes, 16 ppn, test case I, 
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which does not take the neighborhood relation between paral- 
lel subdomains for MPI task placement into account. In case of 
the non-blocking communication, neglecting the neighborhood 
relations causes a drop in performance of about 15 %. For the 
version using blocking communication this effect does not play 
any role since the parallel efficiency for the last data point is 
already very low. 

Finally, Figures [l4| and [TS] show performance and efficiency 
data for the largest set of 2.2 • 10^ CVs. The optimized code 
version at double precision can scale efficiently to node counts 
which are unreachable by the original code (beyond 1400 
nodes, i.e., 22400 cores). The single-precision SIP solver can 
still be of use even at these large node counts. Note that the 
parallel efficiency shown in Fig. 15 is relative to a 180-node 



baseline due to the size of the problem. 



VI. Conclusions and Outlook 

We have implemented several optimizations in FASTEST- 
3D, a parallel finite-volume flow solver based on co-located. 



block-structured meshes, to improve its sequential perfor- 
mance and parallel scalability on modem multicore systems. 
By work-avoiding strategies and employing a single-precision 
linear solver, the single-node performance could be improved 
by about 40 %. At the MPI-parallel level, employing non- 
blocking MPI for block boundary exchange resulted in a mas- 
sive improvement of strong parallel scalability, not because of 
overlapping communication with computation but due to elim- 
inating partial communication serialization. For the purpose 
of comparing the cost of computations we have introduced 
the concept of "minimum acceptable parallel efficiency." At 
an efficiency limit of 50 % we could achieve between 8x 
and lOx speedup over the original version, depending on the 
problem size. The code in its optimized form is now ready 
for massively parallel, strongly scaled simulations on current 
high-performance cluster platforms. 

Future work may include a more thorough study of the 
effects of rank-core mapping on the scaling properties, and 
a detailed analysis of the hybrid (MPI-nOpenMP) version of 
FASTEST-3D, which is expected to show different commu- 



nication and convergence behavior compared to the pure MPI 
version studied here. 
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